Online Self-Indexed Grammar Compression

نویسندگان

  • Yoshimasa Takabatake
  • Yasuo Tabei
  • Hiroshi Sakamoto
چکیده

Although several grammar-based self-indexes have been proposed thus far, their applicability is limited to offline settings where whole input texts are prepared, thus requiring to rebuild index structures for given additional inputs, which is often the case in the big data era. In this paper, we present the first online self-indexed grammar compression named OESP-index that can gradually build the index structure by reading input characters one-by-one. Such a property is another advantage which enables saving a working space for construction, because we do not need to store input texts in memory. We experimentally test OESP-index on the ability to build index structures and search query texts, and we show OESP-index’s efficiency, especially space-efficiency for building index structures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self-Indexed Grammar-Based Compression

Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce th...

متن کامل

Self-indexed Text Compression Using Straight-Line Programs

Straight-line programs (SLPs) offer powerful text compression by representing a text T [1, u] in terms of a restricted context-free grammar of n rules, so that T can be recovered in O(u) time. However, the problem of operating the grammar in compressed form has not been studied much. We present a grammar representation whose size is of the same order of that of a plain SLP representation, and c...

متن کامل

Combining Text Compression and String Matching: The Miracle of Self-Indexing

This decade has witnessed the raise of what I consider the most important breakthrough of modern times in text compression and indexed string matching. Selfindexing is the mechanism by which a text is simultaneously compressed and indexed, so that the self-index occupies space close to that of the compressed text, provides random access to any part of it, and in addition supports efficient inde...

متن کامل

Universal Prediction for Indexed Classes of Sources

We study universal prediction w.r.t. an indexed class of sources (e.g., parametric families) and general loss functions. We explore the centrality of the self-information loss function (log-loss) in the theory of universal prediction by showing that under certain assumptions, the feasibility of universal prediction w.r.t. the log-loss function, over an indexed class of sources (that is, univers...

متن کامل

Online Grammar Compression for Frequent Pattern Discovery

Various grammar compression algorithms have been proposed in the last decade. A grammar compression is a restricted CFG deriving the string deterministically. An efficient grammar compression develops a smaller CFG by finding duplicated patterns and removing them. This process is just a frequent pattern discovery by grammatical inference. While we can get any frequent pattern in linear time usi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015